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ABSTRACT 

HEXEvent (http://hexevent.mmg.uci.edu) is a new 
database that permits the user to compile 
genome-wide exon data sets of human internal 
exons showing selected splicing events. User 
queries can be customized based on the type and 
the frequency of alternative splicing events. For 
each splicing version of an exon, an ESTs count is 
given, specifying the frequency of the event. A 
user-specific definition of constitutive exons can 
be entered to designate an exon exclusion level 
still acceptable for an exon to be considered as con- 
stitutive. Similarly, the user has the option to define 
a maximum inclusion level for an exon to be called 
an alternatively spliced exon. Unlike other existing 
splicing databases, HEXEvent permits the user to 
easily extract alternative splicing information for in- 
dividual, multiple or genome-wide human internal 
exons. Importantly, the generated data sets are 
downloadable for further analysis. 

INTRODUCTION 

The process of pre-mRNA splicing is essential for the ex- 
pression of most metazoan genes. It is carried out by the 
spliceosome that catalyzes the removal of non-coding 
intronic sequences and concatenates remaining exons to 
form mature mRNAs (1). Of the approximately 25 000 
genes encoded by the human genome (2), >90% are 
believed to produce transcripts that are alternatively 
spliced (3,4). The process of alternative splicing results in 
the production of multiple mRNA isoforms from a single 
pre-mRNA and thereby significantly enriches the prote- 
omic diversity of higher eukaryotic organisms. Major 
splice events include exon skipping (also referred to as 
cassette exon events), exons with alternative 3'- and/or 
5'-splice sites, intron retention, mutually exclusive exons, 
as well as alternative first and last exons. Given the com- 
plexity of higher eukaryotic genes and the relatively low 
conservation of splice sites, the precision of the splicing 



machinery is impressive. Defects in splicing lead to many 
human genetic diseases (5-7) and splicing mutations in a 
number of genes involved in growth control have been 
implicated in multiple types of cancer (8-12). 

The vast majority of alternative splicing events are 
biased toward high (>80%) or low (<20%) inclusion 
levels (Figure 1). This distribution is not only observed 
for alternative splicing events defined by Expressed 
Sequence Tag (ESTs), but also for alternative splicing 
events derived from deep sequencing data (13). 
Interestingly, alternatively spliced exons with high inclu- 
sion levels (>90%) display physical characteristics indis- 
tinguishable from constitutive exons (13), even when using 
machine learning techniques (our unpublished data). 
Thus, an exon that is associated with an extremely rare 
alternative splicing event may behave much more like a 
constitutive exon. These observations, in combination 
with the demonstration that even constitutive exons 
display low levels of alternative splicing (4), challenge 
the definition of constitutive and alternative exons. Yet, 
all genome-wide analyses of alternative splicing carried 
out in the past have relied on comparing sets of alterna- 
tively spliced exons with constitutive exons based on 
simple yes/no decisions, thus potentially introducing 
large errors. HEXEvent permits the user to define the in- 
clusion level required for an exon to be considered as al- 
ternatively or constitutively spliced. This definition can 
range from the strictest to more relaxed constitutive 
splicing interpretations. 

To carry out large-scale or genome-wide analyses on 
exons of a certain type, extensive exon data sets of that 
alternative splicing type are needed. ASPicDB (14) is a 
recently published alternative splicing database that 
reports a list of exons within a certain region or gene. 
However, this list is not made available for download. 
While the University of California, Santa Cruz (UCSC) 
Genome Browser offers sets of all alternatively spliced 
exons for download (15), those sets are missing three im- 
portant pieces of information: (i) no set of constitutively 
spliced exons is available; (ii) inclusion and/or usage levels 
are not assigned to exons or splice sites; and (iii) all sets 
have a splice event centric view, meaning they list all exons 
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Figure 1. Cassette exon inclusion levels determined from HEXEvent. 
The plot shows the relationship between exon inclusion levels and the 
cumulative number of events. 

that show a certain splicing event, but do not list all 
splicing events for an individual exon. Finally, neither 
ASPicDB nor the UCSC Genome Browser allow 
user-specific definitions for constitutive and alternative 
exons. In contrast, HEXEvent allows the user to tailor 
splicing categorization based on exon inclusion levels. In 
addition, the queried data is downloadable for further 
analysis regardless of whether it covers individual or 
genome-wide sets of internal human exons. Finally, all 
known alternative splicing events per exon are reported 
in the output file. 



DATABASE GENERATION, CONTENT AND 
DEFINITIONS 

HEXEvent is a new database, which assists the user in 
compiling genome-wide exon data sets. Its exon informa- 
tion is based on known mRNA isoforms as defined by the 
UCSC Genome Browser (GRCh37/hgl9) (15) as well as 
available EST information. For each internal exon in the 
human genome, the number of EST events that include or 
exclude the queried exon or an alternative version of it was 
computed. Based on this information, inclusion/exclusion 
levels of each exon as well as usage frequencies of alter- 
native splice sites are defined. The major splice version of 
each exon was defined, whereas all minor splice variants 
are indicated as alternative splicing events of that exon. 
HEXEvent includes information about exon skipping/in- 
clusion and alternative 3' and/or 5'-splice sites. At this 
stage, we do not include information on intron retention 
events because HEXEvent is currently based on EST 
information. Due to the short length of the ESTs, intron 
retention information would be biased toward short 
retained introns. 

If a new splice variant was found only in ESTs, but it was 
not included in the UCSC isoform list, we accepted it as a 
real new splice alternative (indicated as 'onlyEST instead 
of assigning a gene name), if it showed the canonical splice 
sites (GU at the 5'-splice site, AG at the 3'-splice site). 
Although this splice site requirement might reduce the 
identification of new alternative splicing events associated 
with the minor spliceosome, it significantly reduces the 
number of false positives. Seemingly, alternative versions 



of an exon where the annotated splice sites differ only by 
1 or 2 nt were combined to a single version of the exon 
because the addition of one or two more nucleotides 
rarely defines an alternative splice site (16). 

For each alternative splice version of an exon, the 
number of supporting ESTs is given. For cassette exons, 
the number of ESTs including and not including the exon 
is specified. For exons with one or more alternative splice 
sites, the location of the alternative sites are reported as 
well as the number of ESTs supporting them. 

The basic EST counts as defined in Table 1 are repre- 
sented as follows: 

• c count : the number of ESTs including the exon with the 
major coordinates; 

• c a i t3 : number of ESTs including the exon with an 
alternative 3'-splice site; 

• c a i t5 : number of ESTs including the exon with an 
alternative 5'-splice site; 

• c a i t 3 +5 : number of ESTs including the exon with both; 
an alternative 3'-splice site as well as an alternative 
5'-splice site; and 

• c ski P : number of ESTs excluding the exon 

Based on these numbers several values are calculated for 
each exon. 

• The constitutive level evaluates the inclusion of 
the major version of the exon and compares it to the 
number of occurrences of any alternative event. The 
constitutive level is defined as: constitLevel = 
Ccount/Ccouni+Cait- The alternative count c a h includes 
all counts of possible alternative events, 

Call = C a it3+C a lt5+C a li3+5+C s kip. 

• The inclusion level compares the presence of the exon 
and any alternative version of it with its exclusion. The 
inclusion level is defined as: inclLevel = Cinci/cinci+Cexci- 
The inclusion count Cj nc i equals the sum of ESTs 
showing the exon with major coordinates c CO unt plus 
the number of ESTs showing the exon with an alter- 
native 3'-splice site c a it3, with an alternative 5'-splice 
site or with an alternative 3'- and 5'-splice site 

Calt3+5 (Cincl = Ccount+C a lt3+C a It5+C a lt3+5, See Table 1). 

The exclusion count c e xci = Cskip equals the number of 
ESTs supporting exon skipping. 

• The usage level of the 3'-splice site represents the 
usage ratio of the major 3'-splice site and any used 
3'-splice site. It is defined as: 3usageLevel =c ma j 0 r3/ 
Cmajor3+c 0 ther3- The number of ESTs showing the 
usage of the major 3'-splice site c ma j 0 r3 equals the 
sum of ESTs including the major version of the exon 
Ccount and the number of ESTs including the exon with 
an alternative 5'-splice site but the same 3'-splice 
site c alt5 (c major3 = Ccount+Caits)- In contrast, the 
number of ESTs not showing the exon with its 
major 3'-splice site c 0 ther3 equals the sum of ESTs 
that show the exon either only with an alternative 
3'-splice site c a i t 3 or with mutually occurring 3'- and 

5'-Splice Sites C alt3+5 (c ot her3 = C a lt3+C a lt3+5)- 

• The usage level of the 5'-splice site is defined analo- 
gously to the usage level of the 3'-splice site as: 

5uSageLevel = C ma j or5 /c m ajor5+C 0 ther5- 
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Table 1. Definition of the columns in the output format of HEXEvent for a randomly chosen exon 



No. Column name Example 

1 chromo chrX 

2 strand + 

3 start 101 854639 

4 end 101854775 

5 count 15 

6 alt3 10 

7 alt5 0 

8 alt3+5 2 

9 skip 0 

10 constitLevel 0.556 

1 1 inclLevel 1 .000 

12 3usageLevel 0.556 

13 5usageLevel 0.926 

14 alt3singleCount 10 

15 alt3singleLoc 101 854633 

16 alt5singleCount 0 

17 alt5singleLoc # 

18 alt3and5singleCount 2 

19 alt3and5singleLocName 101854633-101 854 787 

20 OnlyESTexonCount 0 

21 OnlyESTexons # 

22 genename ARMCX5 



Description 



Reference sequence chromosome name 
+ or — for strand 
First position of the exon (0-based) 
Last position of the exon (1 -based) 

Number of ESTs that include the exon as given in columns [chromo], 

[strand], [start], and [end] 
Number of ESTs that include the exon with an alternative 3'-splice site 
Number of ESTs that include the exon with an alternative 5'-splice site 
Number of ESTs that include the exon with an alternative 3' and an 

alternative 5'-splice site simultaneously 
Number of ESTs in which the exon is skipped 

Constitutive level of the exon ( = [cou „ l]+m X°ilm^ + mAip\ > 

Inclusion level of the exon f= [count]+{alii} ■ ■ [alil+5] n 

inclusion level 01 me exon ( [ C omtWm]+[kfyMaia+5\-tfskipV 

Usage level of the major 3'-splice site of the exon 

/ [count]+[all5\ \ 

* — [count] I [a/13] I [alt5] I [alB+5]> 

Usage level of the major 5'-splice site of the exon 

\~ {count} I [a/13] ■ [alt5] I [a/<3+5]' 

Number(s) of ESTs for different alternative 3'- splice site 

Location(s) of alternative 3'-splice sites 

Number(s) of ESTs for different alternative 5'-splice site 

Location(s) of alternative 5'-splice sites 

Number(s) of ESTs for different alternative 3'- and 5'-splice 

sitecombinations 
Location(s) of alternative 3' and 5'-splice site combinations 
Number of ESTs in which the exon is overlapped by an alternative 

version of it, that is not included in the human isoform list of the 

UCSC Genome Browser yet, but has at least one EST supporting it 
Location(s) of alternative version(s) of the exon, that is/are not 

included in the human isoform list of the UCSC Genome Browser 

yet, but has/have at least one EST supporting it 
Name of the gene the exon is part of, if no gene name was 

assignedyet, it is indicated by 'onlyEST 



The first four columns describe the location of the exon, whereas columns 5-9 give EST counts for inclusion (as given in columns 1-4), alternative 
splice site usage and exclusion of the exon. Column 10 specifies the constitutive level of the exon. Here, the grade of being constitutive is calculated 
by comparing the occurrence of the exon as specified in columns 1-4 with all other alternative events. In column 11, the inclusion level of the exon is 
given. Here, inclusion is calculated as the sum of ESTs showing the exon with the coordinates given in columns 1-4 and EST counts for alternative 
version of the exons showing an alternative 3'- and/or 5'-splice site, whereas exclusion is represented by the number of ESTs having this exon 
skipped. Columns 12 and 13 show the usage level of the major 3'-splice site and the major 5'-splice site (as given in columns 3 and 4 or columns 4 
and 3 when on the negative strand), respectively. The usage level of the major 3'-splice site of the exon is calculated as the ratio of the number of 
ESTs showing this 3'-splice site, i.e. all ESTs showing the exons as given in columns 1-4 as well as all ESTs showing the exon with an alternative 
5'-splice site, and the number of ESTs that include the exon with any splice site. The usage level of the major 5'-splice site is calculated analogously. 
The location of all alternative 3'-splice sites of the exon can be found in column 15, whereas the EST counts for each single one are given in column 
14. Respective entries can be found in columns 16 and 17 for alternative 5'-splice sites, as well as in columns 18 and 19 for mutually occurring 3'- and 
5'-splice sites. The EST count and location of the new versions of the exon that have EST evidence but are not confirmed events in the UCSC 
Genome Browser yet, are shown in columns 20 and 21. The location is given in the form 'chromosomeSTRANDstart-end'. The last column shows 
the name of the gene the exon is part of. If none was assigned yet, 'onlyEST' is specified. 



Based on the input, exons are filtered according to the user 
settings and a customized set of exons will be extracted 
and prepared for download. A summary of the basic 
workflow of the creation of the HEXEvent database is 
shown in Figure 2. 



exons are allowed to show alternative splicing events other 
than the selected ones. For each queried exon, HEXEvents 
reports the genome location, the type(s) of alternative 
splicing and their location, as well as the number of 
ESTs in support of the events. An EST count is given 
for each alternative version. 



USAGE 

To use the HEXEvent database, characteristics of the 
queried exons have to be specified by the user. These spe- 
cifications include defining which types of alternative 
splicing events should be investigated and designating an 
inclusion level to classify constitutive and alternative 
exons. Additionally, the user is asked to specify whether 



Input 

All input selections are used to specify the type of exons 
the user is interested in. The basic input can be a genomic 
region, a list of genes or the whole genome. Second, the 
type of alternative splicing the user is interested in needs to 
be defined. The user can choose among options including 
all exon types, only constitutive exons, or any combination 
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UCSC track: 
UCSC Genes 



UCSC track: 
spliced ESTs 



UCSC track: 
human mRNAs 



create list of internal exons 

a valid exon is either 

• found in the UCSC Genes or 

• found in the ESTs/mRNAs and 
shows the canonical splice sites 



assign EST counts 
(inclusion and exclusion) 



combine overlapping exons 
= alternative events 



Figure 2. Workflow during the creation of HEXEvent. We down- 
loaded the UCSC Genes track, the spliced ESTs track, as well as the 
human mRNAs track from the UCSC Genome Browser. Using all 
three data sets, we extracted all known versions of human internal 
exons. An EST count was assigned to each version of each exon, spe- 
cifying inclusion and exclusion levels. In a last step, overlapping exons 
were combined and indicated as alternative versions of each other. 



of cassette, alternative 3'-splice site, alternative 5'-splice 
site and simultaneously alternative 3'- and 5'-splice sites 
exons. Third, the definition of a constitutive exon can be 
specified, i.e. the user decides what exon constitutive level 
(as defined in 'Database Generation, Content and 
Definitions' section) is acceptable for an exon to still be 
called constitutive. Furthermore, the user can restrict the 
set of alternatively spliced exon events by defining an 
upper inclusion level, which, for instance, is useful when 
analyzing only low-inclusion exons. Analogously, the user 
has the option to restrict the 3'- and/or 5'-splice site usage 
level. Fourth, the user chooses whether selected alternative 
splicing events should be unique or can occur in combin- 
ation with other splicing events. To do so, the user will be 
asked whether selected exon types should be ' strict' . Here, 
a 'non-stricf exon definition means that the exon has to 
show at least one of the selected types of alternative 
splicing, but it may also be associated with any of the 
not selected types. In contrast, when a strict exon defin- 
ition is chosen, the exon has to be involved in at least one 
of the selected alternative splicing types, but must not 
show any of the not selected types. The database will 
compile a final list of exons based on these input param- 
eters to be displayed in the browser or to be downloaded 
and saved to a file. 



Output 

The output of each query will be a table of all requested 
exons showing the frequencies of their known splicing 
events. Depending on the user choice, the results will be 
displayed in the browser or they will be written to a down- 
loadable text file. All output columns are described in 
Table 1. 



Example applications 

In an example application of the HEXEvent database, we 
are interested in all exons of the gene ARMCX4 and all 
their known splicing events. To get this information, the 
gene name needs to be specified in the input mask of the 
database. Furthermore, an interest in all types of exons, 
i.e. all alternative events, needs to be selected. The refine- 
ment of the exon type as well as a strict or non-strict exon 
definition has no effect on the output because we re- 
quested all exons, no matter what type. Selecting all 
exons overrides any other selections made. The output 
of this query is shown in Table 2. 

If the user is only interested in exons of the gene 
ARMCX4 that have an alternative 3'-splice site, only al- 
ternative 3'-splice site exons should be selected in the exon- 
type specification. If a strict exon-type definition is chosen, 
exons that have an alternative 3'-splice site, but can also be 
skipped or have an alternative 5'-splice site will not be 
shown (Table 2). To show those, a non-strict exon defin- 
ition should be chosen (Table 2). 

In case the user requests only constitutive exons, a con- 
stitutive exon needs to be defined. In the most conservative 
scenario, where the user allows no alternative events to 
call an exon constitutive, the HEXEvent database will 
report the five exons shown in Table 3. In contrast, if 
the user allows up to 5% of the ESTs showing alternative 
splicing, the database will report one additional exon that 
shows one ESTs with an alternative 3'-splice site in 
addition to 45 ESTs including the major version of the 
exon (Table 3). 



DISCUSSION 

HEXEvent allows users to apply their own definition of 
alternative/constitutive exons and generates a list of exons 
matching the input criteria. For each queried exon, 
HEXEvents reports the genome location, the type(s) of 
alternative splicing, their location and EST counts sup- 
porting each alternative version. Based on these entries, 
an exon inclusion and splice site usage levels are reported. 
The HEXEvent database is a valuable tool to customize 
future bioinformatic analyses of alternative splicing. While 
HEXEvent is currently based on UCSC Genome Browser 
and available EST information, we plan to add alternative 
splicing and exon inclusion data derived from deep 
sequencing reactions in the very near future. 
Furthermore, we intend to expand the database to other 
species. 

Comparison with other databases: while the UCSC 
Genome Browser offers sets of alternatively spliced 
exons for download (15), those sets do not include inclu- 
sion/usage-level information and no set of constitutively 
spliced exons is made available. Furthermore, all available 
sets of alternatively spliced exons have a splice event 
centric view, i.e. they include all exons that show a 
certain splicing event, but do not list all splicing events 
for an individual exon. ASPicDB is a database providing 
information about the splicing pattern of human genes 
(14). While ASPicDB also reports a list of exons in a 
certain region or gene, this list cannot be downloaded. 
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In contrast, the lists of exons generated by HEXEvent are 
downloadable to provide the foundation for subsequent 
bioinformatic analyses. Thus, HEXEvent is suitable for 
local single-gene analyses as well as for more complex or 
genome-wide analyses using downloaded lists. 
Furthermore, in HEXEvent, users can set their own def- 
inition of constitutive and alternative exons by specifying 
inclusion levels up to which an exon is considered a 
member of either category. Finally, HEXEvent summar- 
izes alternative splice site versions of the same exons to 
one entry, thereby specifying all possible alternative splice 
sites. In comparison to HEXEvent, ASPicDB has one 
entry per alternative splice site version, which inherently 
makes the output harder to view and process. While this is 
a comparison to the most recent alternative splicing 
database published in NAR, it is worth noting that 
several other useful databases exist with prior publication 
date, such as Hollywood (17) or ASD/ASTD (18). 
Alternative Splicing Database/Alternative Splicing and 
Transcript Diversity database (ASD/ASTD) was closed 
in the beginning of 2012. Instead, their features have 
been integrated in Ensembl (19). These databases are ex- 
cellent venues to interrogate the splicing patterns of indi- 
vidual genes, especially in light of their excellent 
accompanying graphics. However, all of the existing data- 
bases are limited by (i) not offering the ability to 
download multiple splicing events at a time and by (ii) 
not permitting user definitions of alternative splicing. 
The most useful features of HEXEvent bridge this gap, 
thus permitting users to custom design genome-wide exon 
data sets. 



AVAILABILITY 

This HEXEvent database is freely available at http:// 
hexevent.mmg.uci.edu and open to all users. There is no 
login requirement. 
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